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ABSTRACT 

e- differential privacy is the state-of-the-art model for releasing sen- 
sitive information while protecting privacy. Numerous methods 
have been proposed to enforce e-differential privacy in various an- 
alytical tasks, e.g., regression analysis. Existing solutions for re- 
gression analysis, however, are either limited to non-standard types 
of regression or unable to produce accurate regression results. Mo- 
tivated by this, we propose the Functional Mechanism, a differ- 
entially private method designed for a large class of optimization- 
based analyses. The main idea is to enforce e-differential privacy 
by perturbing the objective function of the optimization problem, 
rather than its results. As case studies, we apply the functional 
mechanism to address two most widely used regression models, 
namely, linear regression and logistic regression. Both theoreti- 
cal analysis and thorough experimental evaluations show that the 
functional mechanism is highly effective and efficient, and it sig- 
nificantly outperforms existing solutions. 

1. INTRODUCTION 

Releasing sensitive data while protecting privacy has been a sub- 
ject of active research for the past few decades. One state-of-the- 
art approach to the problem is e-differential privacy, which works 
by injecting random noise into the released statistical results com- 
puted from the underlying sensitive data, such that the distribution 
of the noisy results is relatively insensitive to any change of a sin- 
gle record in the original dataset. This ensures that the adversary 
cannot infer any information about any particular record with high 
confidence (controlled by parameter e), even if he/she possesses all 
the remaining tuples of the sensitive data. Meanwhile, the noisy 
results should be close to the unperturbed ones in order to be useful 
in practice. Hence, the goal of an e-differential private data publi- 
cation mechanism is to maximize result accuracy, while satisfying 
the privacy guarantees. 

The best strategy to enforce e-differential privacy depends upon 
the nature of the statistical analysis that will be performed using 
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Figure 1: Two examples of regression problems 



the noisy data. This paper focuses on regression analysis, which 
identifies the correlations between different attributes based on the 
input data. Figure 1 illustrates two most commonly used types 
of regressions, namely, linear regression and logistic regression. 
Specifically, linear regression finds the linear relationship between 
the input attributes that fits the input data most. In the example 
shown in Figure la, there are two attributes, age and medical ex- 
penses; the data records are shown as dots. The regression result 
is a straight line with minimum overall distances to the data points, 
which expresses the value of one attribute as a linear function of 
the other one. Figure lb shows an example of logistic regression, 
where there are two classes of data: diabetes patients (shown as 
black dots) and those without diabetes (white dots). The goal is to 
predict the probability of having diabetes, given a patient's other 
attributes (i.e., age and cholesterol level in our example). The re- 
sult of this logistic regression can be expressed as a straight line; 
the probability of a patient getting diabetes is calculated based on 
which side of the line the patient lies in, and its distance to the line. 
In particular, if a patient's age and cholesterol level correspond to 
a point that falls exactly on the straight line, then his/her proba- 
bility of having diabetes is predicted to be 50%. We present the 
mathematical details of these two types of regression in Section 3. 

Although regression is a very common type of analysis in prac- 
tice (especially on medical data), so far there is only a narrow se- 
lection of methods for e-differentially private regression. The main 
challenge lies in the fact that regression involves solving an opti- 
mization problem. The relationship between the optimization re- 
sults and the original data is difficult to analyze; consequently, it is 
hard to decide on the minimum amount of noise necessary to make 
the optimization results differentially private. Most existing solu- 
tions for e-differentially privacy are designed for releasing simple 
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aggregates (e.g., counts), or structures that can be decomposed into 
such aggregates, e.g., trees or histograms. One way to adapt these 
solutions to regression analysis is through synthetic data genera- 
tion (e.g., [7]), which generates synthetic data in a differentially 
private way based on the original sensitive data. The resulting syn- 
thetic dataset can be used for any subsequent analysis. However, 
due to its generic nature, this methodology often injects an unnec- 
essarily large amount of noise, as shown in our experiments. To 
our knowledge, the only known solutions that targets regression 
are [4,5, 16,27], which, however, are either limited to non-standard 
types of regression analysis or unable to produce accurate regres- 
sion results, as will be shown in Sections 2 and 7. 

Motivated by this, we propose the functional mechanism, a gen- 
eral framework for enforcing e-differential privacy on analyses that 
involve solving an optimization problem. The main idea is to en- 
force e-differential privacy by perturbing the objective function of 
the optimization problem, rather than its results. Publishing the re- 
sults of the perturbed optimization problem then naturally satisfies 
e-differential privacy as well. Note that, unlike previous work [4,5] 
which relies on some special properties of the objective function, 
our functional mechanism generally applies to all forms of opti- 
mization functions. Perturbing objective functions is inherently 
more challenging than perturbing scalar aggregate values, for two 
reasons. First, injecting noise into a function is non-trivial; as we 
show in the paper, simply adding noise to each coefficient of a func- 
tion often leads to unbearably high noise levels, which in turn leads 
to nearly useless results. Second, not all noisy functions are valid 
objective functions; in particular, some noisy functions lead to un- 
bounded results, and some others have multiple local minima. The 
proposed functional mechanism solves these problems through a 
set of novel and non-trivial algorithms that perform random per- 
turbations in the functional space. As case studies, we apply the 
functional mechanism to both linear and logistic regressions. We 
prove that for both types of regressions, the noise scale required by 
the proposed methods is constant with respect to the cardinality of 
the training set. Extensive experiments using real data demonstrate 
that the functional mechanism achieves highly accurate regression 
results with comparable prediction power to the unperturbed re- 
sults, and it significantly outperforms the existing solutions. 

The remainder of the paper is organized as follows. Section 
2 reviews related studies of differential privacy. Section 3 pro- 
vides formally defines our problems. Section 4 describes the basic 
framework for the functional mechanism, and applies it to enforce 
e-differential privacy on linear regression. Section 5 extends the 
mechanism to handle more complex objective functions, and solves 
the problem of differentially private logistic regression. Section 6 
presents a post-processing module to ensure that the perturbed ob- 
jective function has a unique optimal solution. Section 7 contains 
an extensive set of experimental evaluations. Finally, Section 8 
concludes the paper with directions for future work. 

2. RELATED WORK 

Dwork et al. [9] propose e-differential privacy and show that 
it can be enforced using the Laplace mechanism, which supports 
any queries whose outputs are real numbers (see Section 3 for de- 
tails). This mechanism is widely adopted in the existing work, but 
most adoptions are restricted to aggregate queries (e.g., counts) or 
queries that can be reduced to simple aggregates. In particular, Hay 
et al. [13], Li et al. [17], Xiao et al. [30], and Cormode et al. [6] 
present methods for minimizing the worst-case error of a given set 
of count queries; Barak et al. [2] and Ding et al. [8] consider the 
publication of data cubes; Xu et al. [31] and Li et al. [18] focus on 
publishing histograms; McSherry and Mironov [20], Rastogi and 



Nath [24], and McSherry and Mahajan [19] devise methods for re- 
leasing counts on particular types of data, such as time series. 

Complement to the Laplace mechanism, McSherry and Talwar 
[21] propose the exponential mechanism, which works for any 
queries whose output spaces are discrete. This enables differen- 
tially private solutions for various interesting problems where the 
outputs are not real numbers. For instance, the exponential mech- 
anism has been applied for the publication of audition results [21], 
coresets [10], frequent patterns [3], decision trees [11], support vec- 
tor machines [25], and synthetic datasets [7, 16]. Nevertheless, nei- 
ther the Laplace mechanism nor the exponential mechanism can 
be easily adopted for regression analysis. The reason is that both 
mechanisms require a careful sensitivity analysis of the target prob- 
lem, i.e, an analysis on how much the problem output would change 
when an arbitrary tuple in the input data is modified. Unfortunately, 
such an analysis is rather difficult for regression tasks, due to the 
complex correlations between regression inputs and outputs. 

To the best of our knowledge, the only existing work that tar- 
gets regression analysis is by Chaudhuri et al. [4,5], Smith [27], 
and Lei [16]. Specifically, Chaudhuri et al. [4,5] show that, when 
the cost function of a regression task is convex and doubly dif- 
ferentiable, the regression can be performed with a differentially 
private algorithm based on the objective perturbation. The algo- 
rithm, however, is inapplicable for standard logistic regression, as 
the cost function of logistic regression does not satisfy convexity 
requirement. Instead, Chaudhuri et al. demonstrate that their algo- 
rithm can address a non-standard type of logistic regression with a 
modified input (see Section 3 for details). Nevertheless, it is un- 
clear whether the modified logistic regression is useful in practice. 
Smith [27] proposes a general framework for statistical analysis 
that utilizes both the Laplace mechanism and exponential mecha- 
nism. However, the framework requires that the output space of 
the statistical analysis is bounded, which renders it inapplicable for 
both linear and logistic regressions. For example, if we preform a 
linear regression on a three dimensional dataset, the output would 
be two real numbers (i.e., the slopes of the regression plane on two 
different dimensions), both of which have an unbounded domain 
(—00, +oo) (see Section 3 for the details of linear regression). 

Lei [16] proposes a regression method that avoids conducting 
sensitivity analysis directly on the regression outputs. In a nut- 
shell, the method first employs the Laplace mechanism to produce 
a noisy multi-dimensional histogram of the input data. After that, 
it produces a synthetic dataset that matches the statistics in the 
noisy histogram, without looking at the original dataset. Finally, 
it utilizes the synthetic data to compute the regression results. Ob- 
serve that, the privacy guarantee of this method is solely decided 
by the procedure that generates the noisy histogram - the subse- 
quent parts of the algorithm only rely on the histogram (instead 
of the original data), and hence, they do not reveal any informa- 
tion about the input dataset (except for the information revealed 
by the noisy histogram). This makes it much easier to enforce e- 
differential privacy, since the multi-dimensional histogram consists 
of only counts, which can be processed with the Laplace mecha- 
nism in a differentially private manner. Nevertheless, as will be 
shown in our experiments (referred to as DPME), Lei's method [16] 
is restricted to datasets with small dimensionality. This is caused by 
the fact that, when the dimensionality of the input data increases, 
this method would generate noisy histogram with a coarser granu- 
larity, which in turn leads to inaccurate synthetic data and regres- 
sion results. In summary, none of the existing solutions produce 
satisfactory results for linear or logistic regressions. 

Finally, it is worth mentioning that there exists a relaxed ver- 
sion of e-differential privacy called (e, ^-differential privacy [23]. 
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Under this privacy notion, a randomized algorithm is considered 
privacy preserving if it achieves e-differential privacy with a high 
probability (decided by 5). This relaxed notion is useful in the sce- 
narios where e-differential privacy is too strict to allow any mean- 
ing results to be released (see [12, 15] for examples). As we will 
show in this paper, however, linear and logistic regressions can be 
conducted effectively under e-differential privacy, i.e., we do not 
need to resort to (e, 8) -differential privacy to achieve meaningful 
regression results. 

3. PRELIMINARIES 

Let D be a database that contains n tuples t\,t2, ■ ■ ■ , t„ and 
d + 1 attributes Xi, X2, ■ ■ ■ , X d ,Y . For each tuple U = 
(xu,Xi2, ■ ■ ■ , Xid,yi), we assume without loss of generality 1 that 

Our objective is to construct a regression model from D that en- 
ables us to predict any tuple's value on Y based on its values on 
Xi, X2, ■ ■ ■ , X d , i.e., we aim to obtain a function p that (i) takes 
{xn,Xi2, ■ ■ ■ , Xid) as input and (ii) outputs a prediction of y t that 
is as accurate as possible. 

Depending on the nature of the regression model, the function 
p can be of various types, and it is always parameterized with a 
vector u of real numbers. For example, for linear regression, p is 
a linear function of xa, Xi2, ■ ■ ■ , xm, and the model parameter oj 
is a d-dimensional vector where the j-th (j € {1, . . . , d}) number 
equals the weight of Xij in the function. To evaluate whether oj 
leads to an accurate model, we have a cost function f that (i) takes 
t i and oj as input and (ii) outputs a score that measures the differ- 
ence between the original and predicted values of j/i given oj as the 
model parameters. The optimal model parameter oj* is defined as: 

n 

oj* = argminY^ f(ti,ui). 
1=1 

Without loss of generality, we assume that oj contains d values 
oji, . . . , tOd- In addition, we consider that f(ti,oj) can be written 
as a function of oJk {k € {1, . . . ,d}) given tu as is the case for 
most regression tasks. 

We focus on two commonly-used regression models, namely, 
linear regression and logistic regression, as defined in the fol- 
lowing. For convenience, we abuse notation and use Xi (i £ 
{1, . . . , d}) to denote (xa, x»2, . . . , x id ), and we use (a;,, j/i) to 
denote U. 

Definition 1 (Linear Regression). Assume without 
loss of generality that the attribute Y in D has a domain 
[— 1, 1]. A linear regression on D returns a prediction function 
p(xi) = xfoj*, where oj* is a vector of d real numbers that 
minimizes a cost function f(U,oj) — (yi — xfoj) 2 , i-e., 




In other words, linear regression expresses the value of Y as a linear 
function of the values of Xi, . . . ,Xd, such that the sum square 
error of the predicted Y values is minimized 2 . 



'This assumption can be easily enforced by changing each Xij to 
— r 2 -^, where ai and /3, denotes the minimum and maximum 

(J3j —aj)-Vd J J 

values in the domain of Xj . 



Definition 2 (Logistic Regression). Assume that the 
attribute Y in D has a boolean domain {0, 1}. A logistic regres- 
sion on D returns a prediction function, which predicts j/i = 1 with 
probability 

p(xi) = exp(xjui*)/(l + exp(xfoj*)), 

where oj* is a vector of d real numbers that minimizes a cost func- 
tion f(ti,cj) = log(l + exp(a:f w)) — yixjoj. That is, 

n 

oj* — argmin^^ (log(l + exp(xf cj)) — yixfco^ 

i=l 

For example, assume that D that contains three attributes X\, 
X2, and Y, such that X\ (resp. X2) represents a person's age 
(resp. body mass index), and Y indicates whether or not the 
person has diabetes. In that case, a logistic regression on the 
database would return a function that maps a person's age and body 
mass index to the probability that he/she would have diabetes, i.e, 
p(xi) — Pr[yi = 1], This formulation of logistic regression is 
used extensively in the medical and social science fields to pre- 
dict whether certain event will occur given some observed vari- 
ables. In [4, 5], Chaudhuri et al. consider a non-standard type of 
logistic regression with modified inputs. In particular, they assume 
that for each tuple U, its value on Y is not a boolean value that 
indicates whether U satisfies certain condition; instead, they as- 
sume yi equals the probability that a condition is satisfied given 
Xi. For instance, if we are to use Chaudhuri et al.'s method to pre- 
dict whether a person has diabetes or not based on his/her age and 
body mass index, then we would need a dataset that gives us ac- 
curate likelihood of diabetes for every possible (age, body mass 
index) combination. This requirement is rather impractical as real 
datasets are often sparse (due to the curse of dimensionality or the 
existence of large-domain attributes) and the likelihoods are not 
measurable. Furthermore, Chaudhuri et al.'s method cannot be ap- 
plied on datasets where Y is a boolean attribute, since their method 
relies on convexity property on the cost function. In standard lo- 
gistic regression, cost function log(l + exp(xju>)) — yixfto (or 
log(l + exp(— yixj u))) in [4,5]) does not meet this assumption. 

To ensure privacy protection, we require that the regression 
analysis should be performed with an algorithm that satisfies e- 
differential privacy, which is defined based on the concept of neigh- 
bor databases, i.e., databases that have the same cardinality but 
differ in one (and only one) tuple. 

Definition 3 ^-Differential Privacy [9]). A random- 
ized algorithm A satisfies e-differential privacy, iff for any output 
O of A and for any two neighbor databases Di and D2, we have 

Pr [A{D!) = 0]<e -Pr [A{D 2 ) = O] . 

By Definition 3, if an algorithm A satisfies e-differential privacy 
for an e close to 0, then the probability distribution of ,4's output 
is roughly the same for any two input databases that differ in one 
tuple. This indicates that the output of A does not reveal signifi- 
cant information about any particular tuple in the input, and hence, 
privacy is preserved. 

As will be shown in Section 4, our solution is built upon the 
Laplace mechanism [9], which is a differentially private framework 

2 There is a more general form of linear regression with an objective 
function (w*, a*) = argmin ( „ jC( ) £)" =1 (y, - xf ■ oj - a) . We 
focus only on the type of linear regression in Definition 1 for ease 
of exposition, but our solution can be easily extended for the more 
general variant. 
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that can be used to answer any query Q (on D) whose output is a 
vector of real numbers. In particular, the mechanism exploits the 
sensitivity of Q, which is defined as 



S(Q) 



max 



|Q(Di)-Q(Ai)||i, 



(1) 



where D\ and D2 are any two neighbor databases, and ||Q(Di) — 
Q{D2)\\i is the Li distance between Q(D\) and Q(D2). Intu- 
itively, captures the maximum changes that could occur in 
the output of Q, when one tuple in the input data is replaced. Given 
S(Q), the Laplace mechanism ensures e-differential privacy by in- 
jecting noise into each value in the output of Q(D), such that the 
noise 77 follows an i.i.d. Laplace distribution with zero mean and 
scale S{Q)/e (see [9] for details): 



Pdfiv) 



2S(Q) 



exp 



S(Q) 



In the rest of the paper, we use Lap (s) to denote a random variable 
drawn from a Laplace distribution with zero mean and scale s. For 
ease of reference, we summarize in Table 1 all notations that will 
be frequently used. 



4. FUNCTIONAL MECHANISM 

This section presents the Functional Mechanism (FM), a general 
framework for regression analysis under e-differential privacy. Sec- 
tion 4.1 introduces the details of the framework, while Section 4.2 
illustrates how to apply FM on linear regression. 

4.1 Perturbation of Objective Function 

Roughly speaking, FM is an extension of the Laplace mecha- 
nism that (i) does not inject noise directly into the regression re- 
sults, but (ii) ensures privacy by perturbing the optimization goal 
of regression analysis. To explain, recall that a regression task on a 
database D returns a model parameter uj* that minimizes an opti- 
mization function fo (uj) = ~}2 t eD f(U,uj). Direct publication of 
uj* would violate e-differential privacy, since uj* reveals informa- 
tion about fo(w) and D. One may attempt to address this issue by 
adding noise to u* using the Laplace mechanism; however, this so- 
lution requires an analysis on the sensitivity of uj* (see Equation 1), 
which is rather challenging given the complex correlation between 
D and uj* . 

Instead of injecting noise directly into uj* , FM achieves e- 
differential privacy by (i) perturbing the objective function fo (oj) 
and then (ii) releasing the model parameter uj that minimizes the 
perturbed objective function f D (oj) (instead of the original one). 
A key issue here is: how can we perturb fo(u>) in a differentially 
private manner given that /zj(w) can be a complicated function of 
ojI We address this issue by exploiting the polynomial representa- 
tion of fo(uj), as will be shown in the following. 

Recall that u is a vector that contains d values wi, . . . ,«<j. 
Let <p(oj) denote a product of namely, (f>(uj) = 

loI 1 ■ UJ2 2 ... uj c / for some ci,...,c d G N. Let $j (J G N) de- 
note the set of all products of uji , . . . , uid with degree j, i.e., 



Ed 
1=1 a 



(2) 



For example, $0 = {1}, $1 = {0J1, ■ ■ ■ ,Wd}, and $2 = 
{uji ■ ujj I i,j € [l,d]}. By the Stone- Weierstrass Theorem [26], 
any continuous and differentiable f(U,uj) can always be written 



Notation 


Description 


D 


database of n records 


ti = {Xi , ]Ji ) 


the i-th tuple in D 


d 


the number of values in the vector Xi 


UJ 


the parameter vector of the regression model 


J [ti,U) 


the cost function of the regression model that 
evaluates whether a model parameter ui leads 
to an accurate prediction for a tuple U = 




fo(oj) =£t 4eD /(*».«) 


* 


uj* = arg min„ f D (w) 




a noisy version of fa (uj) 


UJ 


uj = argmin„7 D (w) 


/d(uj) 


the Taylor expansion of fo (uj) 


UJ 


uj — arg min„ fo (w) 




a low order approximation of 


UJ 


uj — arg min^ fo (uj) 




a product of one or more values in uj, e.g, 

(UJ1) A ■ UJ2 




the set of all possible <j> of order j 




the polynomial coefficient of <j> in f(ti, uj) 



Table 1: Table of notations 



as a (potentially infinite) polynomial of wi, . . . , uj d , i.e., for some 
J € [0, 00], we have 



f(U,uj) = A< **i 



(3) 



where A^ ti G K denotes the coefficient of <)>(uj) in the polyno- 
mial. Similarly, fo (w) can also be expressed as a polynomial of 
wi, . . . ,Wd. 

Given the above polynomial representation of fo (uj), we perturb 
fo (uj) by injecting Laplace noise into its polynomial coefficients, 
and then derive the model parameter uj that minimizes the perturbed 
function f D (uj), as shown in Algorithm 1. The correctness of Al- 
gorithm 1 is based on the following lemma and theorem. 

LEMMA 1. Let D and D' be any two neighbor databases. Let 
fo (uj) and fo' (w) be the objective functions of regression analysis 
on D and D' , respectively, and denote their polynomial represen- 
tations as follows: 

j 

fo(oj) = J2 ^2 J2 A* ti 0(w), 
j=i 4,e<s> j ti£D 

= E E E v^h- 

j=i 0e*j tjeo' 
Then, we have the following inequality: 
j 

E A< **» _ E ^ 



EE 

3 = 1 0S*j 



ti£D t'. e o' 

where U is an arbitrary tuple. 



< 2max^ Y, HVIIi- 

3=1 cpei'i 
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Algorithm 1 Functional Mechanism (Database D, objective 
function fo (w), privacy budget e) 

1: Set A = 2max( £^ =1 IIVIIi 
2: for each < j < J do 
3: for each <f> G $j do 



set A 



5: end for 
6: end for 

7: Let7 D H = £/=iE 0f * 3 .A^) 

8: Compute w = argmin^ f D (u>) 
9: Return cj 



PROOF. Without loss of generality, assume that D and D' differ 
in the last tuple. Let t n (t' n resp.) be the last tuple in D (D 1 resp.). 
Then, 



EE 

j=i 0e4- 



E A **< ~ E x <t>f 

ti£D t'eD' 



E E ii A **n ~ A 0t^iii 

i=i ^eSj 



<E E "VJii + E E iiv^ii 

j 

<2max]T ]T HA^IIi 

□ 

THEOREM 1. Algorithm 1 satisfies e-differential privacy. 

PROOF. Let D and D' be two neighbor databases. Without loss 
of generality, assume that D and D' differ in the last tuple. Let t n 
(t' n ) be the last tuple in D (D'). A is calculated as done on Line 1 
of Algorithm 1, and f(ui) = Ylj=i £ri,e* A<^0(w) be the output 



of Line 7 of the algorithm. We have 

Pr {J(uj) | D) _ n/=i n 0e * 3 ea; P 
Pr{f(oj)\D'} ~ 



A 



J 



3 = 1 0S$j 
J 



E A ^ - E a <k 

i;6D i'eD' 



=nii ffl p (- 



A<f 



j = l rfjG*j 



< 111) 



EE 

j=i 0e*j 



< errp — • 2max^ ^ ll^tHj (by Lemma 1) 



3=1 06*j 



exp (e) . 



In other words, the computation of /(w) ensures e-differential pri- 
vacy. The final result of Algorithm 1 is derived from /(w) without 
using any additional information from the original database. There- 
fore, Algorithm 1 is e-differentially private. □ 

One potential issue with Algorithm 1 is the optimization on the 
noisy objective function f D (oj) (see Line 8) can be unbounded 



when the amount of noise inserted is sufficiently large, leading to 
meaningless regression results. We address this issue later in Sec- 
tion 6. 

In the following, we provide a convergence analysis on Algo- 
rithm 1, showing that its output u is arbitrarily close to the ac- 
tual minimizer of /d(cj), when the database cardinality n is suf- 
ficiently large. Our analysis focuses the averaged objective func- 
tion ^/d(w) instead of fo(uj), since the latter one monotoni- 
cally increases with n. Assume we have a series of databases, 
{Di , Di , . . . , D n ,...}, where each D 3 contains j tuples all drawn 
from a fixed but unknown distribution following probability distri- 
bution function p(t). We have the following lemma. 

LEMMA 2. If is a bounded real number in (-co, +oo) for 
any t and (f> G U^ = i$j, there exists a polynomial g(uj) with con- 
stant coefficients such that lim„^oo ^/r>„(<^) — 9(<-°)- 

PROOF. Based on our polynomial representation scheme, 
kfn n (u) = £/=i (i £r=i Ku) where each U 

is an i.i.d. sample from p(t). When the database cardinality n ap- 
proaches +oo, we rewrite i £™ =1 A< ^*i as f°U° ws: 



1 ™ f 

lim - Xc/yti = / \<j,tp(t)dx = E(X < j >t ) = £■$>■ 



(4) 



By the assumption that \$ t is bounded, c<j, = E(\^ t ) is 
a constant that always exists for any (f> G Ll/ = i$j. Thus, 
linin^oo i/ D „(w) = X)/=i S0e* 3 c<^(w). This completes the 
proof, by letting g(w) = £. J =1 . <V0(i 



□ 



THEOREM 2. database cardinality n — > +oo, the output 

of Algorithm 1 uJ satisfies g{uj) = min„ g(u), if\<j,t is bounded for 
any t and <f> G u/ =1 <E>j. 

PROOF. To prove this theorem, we first show that 
linin^oo ^f Dn (w) = g(u) for any w. 

Given Algorithm 1 with input dataset D„, objective func- 
tion fo n (w) and privacy budget e, the averaged perturbed ob- 
jective function i/ Dn (w) = E-=iE^ 4j <^H = 

E/=i ^ (E?=i A^ tl + Lap (A/e)) 0(w). When n -> 

+od, we have 



lim 



= lim 

n— >oc 



- > Ari, ti + lim -Lap — 

ft z — ' n^oo n V £ 

i = l x 



/ A 

= + lim Lap — 

When A and e > are both finite real numbers, it follows that 
lirrin-^oo Lap (^) = 0. It leads to 



(5) 



Since Equation 5 is applicable to any w, we have 
g(u) = min„5(o;) by proving lim n ^oo \f Dn (w) = 
min„ lim n ^oo (w), which is obvious given the defini- 

tion of cj in Algorithm 1 . □ 

Combining the results of Lemma 2 and Theorem 2, we con- 
clude that the output of Algorithm 1 approaches the minimizer of 
/d„ (w), when database cardinality n — > +oo. 
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4.2 Application to Linear Regression 

After presenting the general framework, we next apply FM to 
linear regression as shown in Definition 1. In linear regression, 
recall that U = (xi,yi) is the i-th tuple in database D with 

\jY^i=i x ld — 1 an d Hi G [— 1, 1]; w is a d-dimensional vector 
contains model parameters. The expansion of objective function 
fo (w) for linear regression is 



/z>(«) 



Y {yi-xju^ 



= Y (^) 2 ~ Y 2 Y yiXi i w j + 

Y Y Xi i Xil UJ ^ 1 ' 

l<j,l<d \ti£D J 

Therefore, /d(w) only involves monomials in $o, 3>i and $2- 
Since each Xi locates in the d-dimensional unit sphere and j/i € 
[—1,1], given objective function of linear regression, Line 1 of Al- 
gorithm 1 could calculate the parameter A as 



A = 2 max IIA 
t=(x, v ) f-f f-i 



0*I|1 



< 



\ 2 = 1 l<2,i<d 



x ( 2)a;(z) 



< 2(l + 2d + d 2 



where i is an arbitrary tuple and X(j\ denotes the j-th dimension of 
vector x. Thus, Algorithm 1 adds Lap(2(d + l) 2 /e) noise to each 
coefficient and the optimization on to is run on the noisy objective 
function. 

For example, assume that we have a two-dimensional database D 
with three tuples: (xi,2/i) = (1,0.4), (22,2/2) = (0.9, 0.3), and 
(£3,2/3) = (—0.5,-1). The objective function for linear regres- 
sions /dH = 2. 06oj 2 - 2. 34w + 1.25, with optimal a/ = 
If we apply Algorithm 1 on D, then Line 1 of Algorithm 1 would 
set A = 2(d+ 1) 2 =8, and then generate the noisy objective func- 
tion f D (u). Figure 2 shows an example of fo and f D . Notice that 
the global optimum of f D (co) is close to the original w* when the 
coefficients are approximately preserved. 

The analysis for linear regression is fairly simple, because the 
objective function is itself a polynomial on uj. For other regression 
tasks (e.g., logistic regression), Algorithm 1 cannot be directly ap- 
plied, as the objective function may not be a polynomial with finite 
order. In the next section, we will present a solution to tackle this 
problem. 

5. POLYNOMIAL APPROXIMATION OF 
OBJECTIVE FUNCTIONS 

For Algorithm 1 to work, it is crucial that the polynomial form 
of the objective function fo (w) contains only terms with bounded 
degrees. While this condition holds for certain types of regression 
analysis (e.g., linear regression, as we have shown in Section 4), 
there exist regression tasks where the condition cannot be satis- 
fied (e.g., logistic regression). To address this issue, this section 
presents a method for deriving an approximate polynomial form 
of fo (w) based on Taylor expansions. For ease of exposition, we 
will focus on logistic regression, but our method can be adopted for 
other types of regression tasks as well. 




Figure 2: Example of objective function for linear regression 
and its noisy version obtained by FM 



5.1 Expansion 

Consider the cost function f(ti,uj) of regression analysis. As- 
sume that there exist 2m functions /1, . . . , f m and gi,...,g m , 
such that (i) f(U,u) = YT=i fi{9i{U,u)), and (ii) each g t is a 
polynomial function of wi, . . . ,cu m . (As will be shown shortly, 
such a decomposition of f(ti,u) is useful for handling logistic 
regression.) Given the above decomposition of /(tj,w), we can 
apply the Taylor expansion on each _/)(•) to obtain the following 
equation: 

m 00 r(k) / \ 

f\t i ,u) = ^^ I ± 1 p>-(g l (t i ,u J )-z l ) k , (6) 
1=1 k=0 

where each zi is a real number. Accordingly, the objective function 
f D (w) can be written as: 



n m oc r(k) 



/, w («) 



i=l ( = 1 fc = 



A:! 



{gi(U,u) - zi) k 



(7) 



To explain how Equations 6 and 7 are related to logistic regres- 
sion, recall that the cost function of logistic regression is f(U , uj) = 
log(l + exp(xju)) — yixfu). Let /1, fi, g\, and t/2 be four func- 
tions defined as follows: 

gi(ii,w) = xju, g 2 (U, u) = Vixfaj, 

fi(z) = log(l + oxp(z)), f 2 (z) = z. 

Then, we have/ {U,u) = fi(gi(U, u)) + f 2 (g2(ti, w)). By Equa- 
tions 6 and 7, 



fl ( z i) 



(8) 



i=l (=1 fe=0 



Since f2(z) = z, we have f 2 '—Q for any k > 1. Given this fact 
and by setting zi — 0, Equation 8 can be simplified as 



n 00 



yy /rtO) 

i=l k=0 



(xju) -^Y^yixJ^juj (9) 



There are two complications in Equation 9 that prevent us from ap- 
plying it for private logistic regression. First, the equation involves 
a infinite summation. Second, the term f[ k \o) involved in the 
equation does not have closed form solution. To address these two 
issues, we will present an approximate approach that reduces the 
degree of the summation, and the approach only requires the value 
of/ 1 (fe) (0)forfc = 0,l,2, i.e.,/ 1 (0) (0) = log 2, (0) = i.and 

A (2) (o) = i 
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5.2 Approximation 

Our approximation approach works by truncating the Taylor se- 
ries in Equation 9 to remove all polynomial terms with order larger 
than 2. This leads to a new objective function with only low order 
polynomials as follows: 



m n 



1=1 i=l 

m n 2 r(k) / \ 

= EEE^fW.*)-*)* do) 



=1 fc=0 



A natural question is: how much error would the above approxima- 
tion approach incur? The following lemmata provide the answer. 



2.3 
2.25 

2.2 
2.15 

2.1 
2.05 
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1.95 
1.9 



f D (a>) 
f D (co) 
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1 

CO 
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Figure 3: Example of objective function for logistic regression 
and its polynomial approximation 



LEMMA 3. Let Hi — arg min^ fo (w) and ui — 
argmin„ /d(cj). Let L — max„ (/d(w) — and 
S — min„ (/d(i^) — /d(o;)). We have the following inequality: 



f D (Cj)-f D {Q) < L-S 



(11) 



PROOF. Observe that L > f D (ui) - f D (ui) and S < /d(w*) - 
/r>(w*). Therefore, 

f D (u) - f D (") ~ M"*) + M"') < L -s- 
In addition, /d(w) — /d(«*) < 0. Hence, Equation 11 holds. □ 

Lemma 3 shows that the error incurred by truncating the Taylor 
series approximate function depends on the maximum and mini- 
mum values of — To quantify the magnitude of the 
error, we first rewrite /d(w) — /d(w) in a form similar to Equa- 
tion 8: 



m n oo 



/x 3 (o;)-/i,(w) = £££ 



(=1 i=l fc=3 



ft! 



To derive the minimum and maximum values of the function 
above, we look into the remainder of Taylor expansion. The fol- 
lowing lemma provides exact lower and upper bounds on fo {to) — 
fn(to), which is a well known result [1]. 

LEMMA 4. For any z G [zi - 1, zi + 1], i (Jd(u) - /r>(w)) 
mw^f in the interval 



min/ i (3) ( g )(g 



£ 



max /; (3) (g)( g 



By combining Lemmata 3 and 4, we can easily calculate the error 
incurred by our approximation approach. In particular, the error 
only depends on the structure of the function, and is independent 
of the characteristics of the dataset. Furthermore, the average error 
of the approximation is always bounded, since 



(i + e)3 > max z f[ 3 \z) — ,l' +e yi ■ Thus, the average error of the 
approximation is at most 



(e 2 



6(1 + e) 3 6(1 + e) 3 

6(1 + e) 3 
w 0.015 

In other words, the error of the approximation on logistic regres- 
sion is a small constant. However, because of this error, there does 
not exist a convergence result similar to the one stated in Theorem 
2. That is, there is a gap between the results from our approxi- 
mation approach and those from a standard regression algorithm. 
To illustrate this, let us consider a two-dimensional database D 
with three tuples, (xi,yi) = (—0.5,1), (#2,2/2) = (0,0), and 
(£3,2/3) = (1, 1). Figure 3 illustrates the objective function of lo- 
gistic regression fo as well as its approximation fo- As will be 
shown in our experiments, however, our approximation approach 
still leads to accuracy regression results. 

5.3 Application to Logistic Regression 

Algorithm 2 presents an extension of Algorithm 1 that incorpo- 
rates our polynomial approximation approach. In particular, given 
a dataset D, Algorithm 2 first constructs a new objective function 
/d (w) that approximates the original one, and then feeds the new 
objective function as input to Algorithm 1. The model parame- 
ter returned from Algorithm 1 is then output as the final results 
of Algorithm 2. It can be verified that Algorithm 2 guarantees e- 
differential privacy - this follows from the fact that (i) Algorithm 1 
guarantees e-differential privacy for any given objective function 
(regardless of whether it is an approximation), and (ii) the output 
of Algorithm 2 is directly obtained from Algorithm 1 . 

To apply Algorithm 2 for logistic regression, we set 



£ 



71 i,s V-V- /i W (0) ( t \ k t\ 

m^ z f^(z)(z-z l ) 3 -mm z f^(z)(z~z l ) 3 /d(^) = ££ ~kT~ \ Xi U ) ~ 

7. ■ i=l fc=0 ' \i=l / 



The above analysis applies to the case of logistic regression as 
follows. First, for the function fi(z) = log(l + oxp(z)), we have 

A (3) (*) = e l ( it"xp(T))3 ))2 • H ^n be verified that mm z A (3) (z) = 



(This is by Equation 8 and the fact that /b(cj) retains only the low 
order terms in the /o(w).) After that, when /d(w) is fed as part 
of input to Algorithm 1 , Line 1 of Algorithm 1 would calculate the 
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parameter A as: 



A = 2 max 

t=(x,y) 



d , 

.7 = 1 J 



X U) + 



A (2) (Q) 

2! 



< 2(- + ¥ +d) 
= C+ 3d > 

where i is an arbitrary tuple and sy) denotes the j'-th dimension of 
the vector x. 

Recall that Algorithm 1 injects Laplace noise with scale A/e 
to the coefficients of the objective function (see Line 4 of Algo- 
rithm 1). Therefore, A = d 2 /4 + 3d indicates that the amount of 
noise injected by our algorithm is only related to d and is indepen- 
dent of the dataset cardinality. 

6. AVOIDING UNBOUNDED NOISY OB- 
JECTIVE FUNCTIONS 

As shown in the previous sections, FM achieves e-differential 
privacy by injecting Laplace noise into the coefficients of the ob- 
jective functions of optimization problems. The injection of noise, 
however, may render the objective function unbounded, i.e., there 
may not exist any optimal solution for the noisy objective function. 
For instance, if we fit a linear regression model on a two dimen- 
sional dataset, the objective function would be a quadratic function 
jo (w) = auj 2 + buj + c with a minimum point (see Figure 2 for an 
example). If we add noise into the coefficients of jo{oj), however, 
the resulting objective function may be no longer have a minimum, 
i.e., when coefficient a becomes non-positive after noise injection. 
In that case, there does not exist a solution to the optimization prob- 
lem. 

One simple approach to address the above issue is to re-run FM 
whenever the noisy objective function is unbounded, until we ob- 
tain a solution to the optimization problem. This approach, as 
shown in the following lemma, ensures e-differential privacy but 
incurs two times the privacy cost of FM. 

LEMMA 5. Let A* be an algorithm that repeats Algorithm 1 
with privacy budget e on a dataset, until the output of Algorithm 1 
corresponds to a bounded objective function. Then, A* satisfies 
(2e)-differential privacy. 

PROOF. Let D and D' be any two neighbor datasets, A be Algo- 
rithm 1, and O be any output of A. Since A ensures e-differential 
privacy (see Theorem 1), we have 

e~ € ■ Pr [A(D') =0] < Pr [A(D) = O] 

< e ■ Pr [A(D') = O] (12) 

Let + be the set of outputs by Algorithm 1 that correspond to 
bounded objective functions. For any + G + , we have 



Pr[A(D) = + ] = 



< 



Pr [A*(D) = 0+] 
J2 , eo+ Pr[A*(D) = 0>] 
e e -Pr [A*(D')=0+] 



e- e -52o'eo+ p r[A*(D') = 0>] 
< e 2t ■ Pr[A*(D') = + ]. 



(By Eqn. 12) 



Algorithm 2 Functional Mechanism (Database D, objective 

function /d(w), privacy budget e) 
1: Decompose the function j(U,ui) = Z~2iji(9'( t ^ LU ))- 
2: Build a new objective function jo(to), such that 

= YZ, E?=i ELo (9l(U,w) - zif 



3: Run Algorithm 1 with input {d, /d(w), . 
4: Return to from Algorithm 1. 



Although repeating FM provides a quick fix to obtain bounded 
objective functions, it leads to sub-optimal results as it entails a 
considerably higher privacy cost than FM does. To address this 
issue, we propose two methods to avoid unbounded objective func- 
tions in linear and logistic regressions, as will be detailed in Sec- 
tions 6.1 and 6.2. 

6.1 Regularization 

As shown in Sections 4 and 5, given a linear or logistic regression 
task, FM would transform the objective function into a quadratic 
polynomial jo{oj), after which it injects noise into the coefficients 
of jo (w) to ensure privacy. Let jo (w) = u T Mu> + qcj + /3 be the 
matrix representation of the quadratic polynomial, and /d(w) = 
co T M*lo + a*cj + ft* be the noisy version of jo (w) after injection 
of Laplace noise. Then, M must be symmetric and positive defi- 
nite [28]. To ensure that jo(<jj) is bounded after noise injection, it 
suffices to make M* also symmetric and positive definite [28]. 

The symmetry of M* can be easily achieved by (i) adding noise 
to the upper triangular part of the matrix and (ii) copying each en- 
try to its counterpart in the lower triangular part. In contrast, it is 
rather challenging to ensure that M* is positive definite. To our 
knowledge, there is no existing method for transforming a positive 
definite matrix into another positive definite matrix in a differen- 
tially private manner. To circumvent this, we adopt a heuristic ap- 
proach called regularization from the literature of regression analy- 
sis [14,29]. In particular, we add a positive constant A to each entry 
in the main diagonal of M* , such that the noisy objective function 
becomes 



jo(cj) = lu t (M* +XI)oj + + , 



(13) 



□ 



where / is a d x d identity matrix, and a* and /3* are the noisy 
versions of a and /3, respectively. 

Although regularization is mostly used in regression analysis to 
avoid overfilling [14,29], it also helps achieving a bounded /d(w). 
To illustrate this, consider that we perform linear regression on a 
two dimensional database. We have d — 1. In addition, each of 
u, M* + X ■ I, a* , and j3* contains only one value (see Figure 2 
for an example). Accordingly, the noisy objective function 
would be a quadratic function with one variable ui. Such a function 
has a minimum, if and only if M* + XI is positive. Intuitively, we 
can ensure this as long as A is large enough to mitigate the noise 
injected in M* . 

In general, for any d > 1, a reasonably large A makes it more 
likely that all eigenvalues of M* + XI are positive, in which case 
M* + XI would be positive definite. Meanwhile, as long as A does 
not overwhelm the signal in M* , it would not significantly degrade 
the quality of the solution to the regression problem. In our experi- 
ments, we observe that a good choice of A equals 4 times standard 
deviation of the Laplace noise added into M*. Note that setting A 
to this value does not degrade the privacy guarantee of FM, since 
the standard deviation of the Laplace noise does not reveal any in- 
formation about the original dataset. 
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Although regularization increases the chance of obtaining a 
bounded objective function, there is still a certain probability that 
the noise objective function does not have a minimum even after 
regularization. This motivates our second approach, spectral trim- 
ming, as will be explained in Section 6.2. 

6.2 Spectral Trimming 

Let /d(w) = ui T (M* + XI) ui + a*ui + /3* be the noisy ob- 
jective function with regularization. As we have discussed in Sec- 
tion 6.1, M* + XI is symmetric (due to the symmetry of M*). 
In addition, Jd(oj) is unbounded if and only if M* + XI is not 
positive definite, which holds if and only if at least one eigenvalue 
of M* + XI is not positive [28]. In other words, to transform an 
unbounded fo into a bounded one, it suffices to get rid of the non- 
positive eigenvalues of M* + XI. 

Let Q T AQ be the eigen-decomposition of M* + XI, i.e., Q is a 
d x d matrix where each row is an eigenvector of M* + XI, and A is 
a diagonal matrix where the i-th diagonal element is the eigenvalue 
of M* + XI corresponding to the eigenvector in the i-th row of Q. 
We have Q T Q = I. Accordingly, 

J D (w) = uj t (Q T AQ) uj + a (Q T Q) u + P* 

Suppose that the i-th diagonal element et of A is not positive. 
Then, we would remove e» from A, which results in a (d — 1) x 
(d — 1) diagonal matrix. In addition, we would also delete the i 
row in Q, so that Q T AQ would still be well-defined. In general, 
if A contains k non-positive diagonal elements, then removal of all 
those elements would transform A into a (d — k) x (d—k) matrix, 
which we denote as A'. Accordingly, Q becomes a (d — k) x d 
matrix, which we denote as Q'. The noisy objective function then 
becomes 

f D {u) = lo t (q' t A'Q') u + a {q' t Q') u + 0\ (14) 
We rewrite /r»(w) as a function of Q'ui: 

go (Q'w) = (Q'ujf A' (Q'ui) + a*Q ,T (Q'uj) + /T, 

which is a bounded function of Q'uj since all eigenvalues of A' 
are positive. We compute the vector V that minimizes go(V), and 
then derive ui by solving Q'uj = V (note that the solution to this 
equation is not unique). 

In summary, we delete non-positive elements in A to obtain a 
bounded objective function, based on which we derive the model 
parameters. Intuitively, the non-positive elements in A are mostly 
due to noise, and hence, removing them from A would not in- 
cur significant loss of useful information. Therefore, the objective 
function in Equation 14 may still lead to accurate model parame- 
ters. The removal of non-positive elements from A does not violate 
e-differential privacy, as the removing procedure depends only on 
M* (which is differentially private) instead of the input database. 

7. EXPERIMENTS 

This section experimentally evaluates the performance of FM 
against four approaches, namely, DPME [16], Filter-Priority (FP) 
[7], NoPrivacy, and Truncated. As explained in Section 2, DPME 
is the state-of-the-art method for regression analysis under e- 
differential privacy, while FP is an e-differentially private tech- 
nique for generating synthetic data that can also be used for re- 
gression tasks. NoPrivacy and Truncated are two algorithms that 
performs regression analysis do not enforce e-differential privacy: 
NoPrivacy directly outputs the model parameters that minimize the 
objective function, and Truncated returns the parameters obtained 



Parameter 


Range and Default Value 


Data Subset Sampling Rate 


0.1,0.2,0.3,0.4,0.5, 
0.6,0.7,0.8,0.9,1 


Dataset Dimensionality 


5, 8, 11, 14 


Privacy Budget e 


3.2, 1.6,0.8,0.4,0.2,0.1 



Table 2: Experimental parameters and values 



from an approximate objective function with truncated polynomial 
terms (see Section 5). We include Truncated in the experiments, so 
as to investigate the error incurred by the low-order approximation 
approach proposed in Section 5. For DPME and FP, we use the im- 
plementations provided by their respective authors, and we set all 
internal parameters (e.g., the granularity of noisy histograms used 
by DPME) to their recommended values. All experiments are con- 
ducted using Matlab (version 7.12) on a computer with a 2.4GHz 
CPU and 32GB RAM. 

We use two datasets from the Integrated Public Use Microdata 
Series [22], US and Brazil, which contain 370, 000 and 190, 000 
census records collected in the US and Brazil, respectively. There 
are 13 attributes in each datasets, namely, Age, Gender, Martial 
Status, Education, Disability, Nativity, Working Hours per Week, 
Number of Years Residing in the Current Location, Ownership of 
Dwelling, Family Size, Number of Children, Number of Automo- 
biles, and Annual Income. Among these attributes, Marital status 
is the only categorical attribute whose domain contains more than 
2 values, i.e., Single, Married, and Divorced/Widowed. Follow- 
ing common practice in regression analysis, we transform Marital 
Status into two binary attributes, Is Single and Is Married (an in- 
dividual divorced or widowed would have false on both of these 
attributes). With this transformation, both of our datasets become 
14 dimensional. 

We conduct regression analysis on each dataset to predict the 
value of Annual Income using the remaining attributes. For logistic 
regression, we convert Annual Income into a binary attribute: val- 
ues higher than a predefined threshold are mapped to 1, and other- 
wise. Accordingly, when we use a logistic model to classify a tuple 
t, we predict the Annual Income of t to be 1 if cx p( x i ") > o.5 

l+exp(a;^ u) 

(see Definition 2), where uj is the model parameter, and x is a 
vector that contains the values of t on all attributes expect An- 
nual Income. We measure the accuracy of a logistic model by 
its misclassification rate, i.e., the fraction of tuples that are in- 
correctly classified. The accuracy of a linear model, on the other 
hand, is measured by the mean square error of the predicted values, 

'• e -> k Tm=i {U' ~ x i ' where n is the number of tuples in the 
dataset, yi is the Annual Income value of the i-th tuple, a; is a vec- 
tor that contains the other attribute values of the tuple, and uj is the 
model parameter. 

In each experiment, we perform 5-fold cross-validation 50 times 
for each algorithm, and we report the average results. We vary 
three different parameters, i.e., the dataset size, the dataset dimen- 
sionality, and privacy budget e. In particular, we generate random 
subsets of the tuples in the US and Brail datasets, with the sampling 
rate varying from 0.1 to 1. To vary the dataset dimensionality, we 
select three subsets of the attributes in each dataset for classifica- 
tion. The first subset contains 5 attributes: Age, Gender, Education, 
Family Size, and Annual Income. The second subset consists of 8 
attributes: the aforementioned five attributes, as well as Nativity, 
Ownership of Dwelling, and Number of Automobiles. The third 
subset contains all attributes in the second subset, as well as Is Sin- 
gle, Is Married, and Number of Children. Table 2 summarizes the 
parameter values, with the default values in bold. 
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Figure 4: Regression accuracy v.s. dataset dimensionality 

-^NoPrivacy —^-Truncated —Q—FM A FP S-DPME 

mean square error misclassification rate 



0.05 




5 8 11 14 

dimensionality 

(d) Brazil-Logistic 



misclassification rate 

g|==E%=E%-S=«=g-g-Q-B-E] 

25% T ~ A i 

Q,. 

20', e © g e e 



15 r /i 



0.3 0.50.6 

sampling rate 

(a) US-Linear 



0.3 0.5 0.6 0.8 1 

sampling rate 



0.1 0.3 0.5 0.6 0.8 1 0.1 

sampling rate 

(b) Brazil-Linear (c) US-Logistic 

Figure 5: Regression accuracy v.s. dataset cardinality 
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7.1 Accuracy vs. Dataset Dimensionality 

Figures 4a and 4b illustrate the linear regression error of each al- 
gorithm as a function of the dataset dimensionality. We omit Trun- 
cated in the figures, as our approximation approach in Section 5 
is required only for logistic regression but not linear regression. 
Observe that FM consistently outperforms FP and DPME, and its 
regression accuracy is almost identical to that of of NoPrivacy. In 
contrast, FP and DPME incur significant errors, especially when 
the dataset dimensionality is large. 

Figures 4c and 4d show the error of each algorithm for logistic 
regression. The error of Truncated is comparable to that of No- 
Privacy, which demonstrates the effectiveness of our low-order ap- 
proximation approach that truncates the polynomial representation 
of the objective function. The error of FM is slightly higher than 
that of Truncated , but it is still much smaller than the errors of FP 
and DPME. 

7.2 Accuracy vs. Dataset Cardinality 

Figure 5 show the regression error of each algorithm as a func- 
tion of the dataset cardinality. For both regression tasks and for 
both datasets, FM outperforms FP and DPME by considerable mar- 
gins. In addition, for linear regression, the difference in accuracy 
between FM and NoPrivacy is negligible; meanwhile, their accu- 
racy remains stable with varying number of records in the database, 
except when the sampling rate equals 0.1 (the smallest value used 
in all experiments). In contrast, the performance of FP and DPME 
improves with the dataset cardinality, which is consistent with the 
theoretical result in [7] and [16]. Nevertheless, even when we use 
all tuples in the dataset, the accuracy of FP and DPME is still much 
worse than that of FM and NoPrivacy. 

For logistic regression, there is a gap between the accuracy of 
FM and that of NoPrivacy and Truncated, but the gap shrinks 



rapidly with the increase of dataset cardinality. The errors of FP 
and DPME also decrease when the dataset cardinality increases, 
but they remain considerably higher than the error of FM in all 
cases. 

7.3 Accuracy vs. Privacy Budget 

Figure 6 plots the regression error of each algorithm as a function 
of the privacy budget e. The errors of NoPrivacy and Truncated 
remain unchanged for all e, as none of them enforces e-differential 
privacy. All of the other three methods incur higher errors when 
e decreases, as a smaller e requires a larger amount of noise to 
be injected. FM outperforms FP and DPME in all cases, and it is 
relatively robust against the change of e. In contrast, FP and DPME 
produce much less accurate regression results, especially when e is 
small. 

7.4 Computation Time 

Finally, Figures 7-9 report the average running time of each al- 
gorithm. Due to the space constraint, we only report the results for 
logistic regression; the results for linear regression are qualitatively 
similar. Overall speaking, the running time of FM is at least one 
order of magnitude lower than that of NoPrivacy, which in turn is 
about two times faster than FP and DPME. The efficiency of FM is 
mainly due to its low-order approximation module, which truncates 
the polynomial representation of the objective function and retains 
only the first and second order terms. As a consequence, FM com- 
putes the optimization results by solving a multi-variate quadratic 
optimization problem, for which Matlab has an efficient solution. 
In contrast, all other methods require solving the original optimiza- 
tion problem of logistic regression, which has a complicated ob- 
jective function that renders the solving process time consuming. 
In addition, FP and DPME require additional time to generate syn- 
thetic data, leading to even higher computation cost. 
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Figure 7: Computation time v.s. dataset dimensionality on lo- 
gistic regression 



Figure 8: Computation time v.s. dataset cardinality on logistic 
regression 



As shown in Figure 7 and Figure 8, the computation time of all 
algorithms increases with the dimensionality and cardinality of the 
databset. This is expected, as a larger number of tuples (attributes) 
leads to a higher complexity of the optimization problem. The ex- 
ecution time of FP and DPME increases at a faster rate than that 
of FM and NoPrivacy, since the former two require generating syn- 
thetic data, which entails higher computation cost when the number 
of tuples (attributes) in the dataset increases. On the other hand, as 
shown in Figure 9, the privacy budget e has negligible effects on the 
running time of the algorithms, since it affects neither the size or di- 
mensionality of the dataset nor the complexity of the optimization 
problem being solved. 

In summary, FM is superior to FP and DPME in terms of both 
accuracy and efficiency in all experiments. The advantage of FM 
in terms of regression accuracy is more pronounced when the data 
dimensionality increases. Furthermore, the accuracy of FM is even 
comparable to NoPrivacy in the scenarios where the cardinality of 
the dataset or the privacy budget is reasonably large. These results 
demonstrate that FM is a preferable method for differentially pri- 
vate regression analysis. 

8. CONCLUSION AND FUTURE WORK 

This paper presents a general approach for differentially private 
regression. Different from existing techniques, our approach con- 
ducts both sensitivity analysis and noise insertion on the objective 
functions, which leads to more accurate regression results when the 
objective functions can be represented as finite polynomials. To 
tackle more complex objective functions with infinite polynomial 
representation (e.g., logistic regression), we propose to truncate the 
Taylor expansion of the objective function, and we analyze the er- 
ror incurred in the optimization results. Our empirical studies on 
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Figure 9: Computation time v.s. privacy budget on logistic re- 
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real datasets validate our theoretical results and demonstrate the ef- 
fectiveness and efficiency of our proposal. 

For future work, we plan to extend our research on the following 
directions. First, our current mechanism only works with objec- 
tive functions in the form of ^"=1 /(**> w )- However, there exist 
regression tasks with more complicated objective functions (e.g., 
Cox regression). It is interesting and challenging to investigate how 
those regression tasks can be addressed. Second, besides Taylor 
expansion, there may exist other analytical tool that can be used 
to approximate the objective functions. We plan to study whether 
alternative analytical tool can lead to more accurate regression re- 
sults. 
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